Failover can occur in one of two situations: An entire node fails, or a resource on a node fails (such as a network interface). Node failure is easily identifiable in Cluster Administrator. When you are connected to a cluster, Cluster Administrator displays a list of all nodes and their current status. Resource failure that does not bring down a node, but does result in group failover, can be harder to detect. Use Cluster Administrator or another cluster management application to periodically check the owner of all cluster groups to see if a group has failed over.
It is important to use Cluster Administrator to routinely monitor the status of all clusters and check for failover activities that diminish either the performance or the availability of resources. Failover is supposed to be easily tolerated. Your work flow will not suffer if all applications are working, failover policies have been correctly set, and you have provided sufficient capacity for all situations. If any of these three criteria are not met, you must take action to meet them. For information on how to plan your server cluster, failover policies, and capacity requirements, see Planning your server cluster.
Serious failover situations can affect performance and availability. If a node fails and its companion node is unable to serve clients efficiently, the situation must be resolved immediately. If groups persistently fail over and fail back without surviving on either node for very long, availability to clients is completely lost.
When investigating possible problems with failovers, answer the following questions:
For example, other resources could be suffering due to the increased load placed on the alternate node.
Many types of failure can cause failovers. Usually, these occur at the operating system and hardware levels. For help with determining the exact type of a failure, see Troubleshooting.
To review what happened prior to and during a group failure, run Event Viewer and look at the application log, system log, and the security log (if applicable). This helps you determine what types of errors occurred.
If your performance has seriously suffered as a result of a failover, review the capacity planning issues presented in Capacity planning.
If you determine that the nodes did not efficiently transfer control of various groups and resources during failover, review the failover policies you have set. For example, the Cluster service could be attempting to bring a File Share resource online before the corresponding Physical Disk resource was brought online. This type of error is easily fixed by configuring dependencies.
The Cluster service records error messages about resource failure in the system log. The system log is found in Event Viewer. The Event Properties dialog box contains error message information. Information about a resource failure might appear as follows:
Date: 9/8/2001
Time: 3:22:49 PM
Type: Error
Source: ClusSvc
Category: Failover Mgr
Event ID: 1069
User: N/A
Computer: DTCENTER3
Description: "Cluster resource 'Disk V:' in Resource Group 'SQL' failed.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp."
Other applications can keep track of such errors and notify appropriate parties as necessary.
For more information about diagnosing and resolving specific kinds of failures and errors, see Troubleshooting. For more information on resource dependencies, see Groups.
For information on how to solve failover problems, see Troubleshooting.